Processing long-latency instructions in a pipelined processor

ABSTRACT

There is provided a method and processor for processing a thread. The thread comprises a plurality of sequential instructions, the plurality of sequential instructions comprising some short-latency instructions and some long-latency instructions and at least one hazard instruction, the hazard instruction requiring one or more preceding instructions to be processed before the hazard instruction is processed. The method comprises the steps of: a) before processing each long-latency instruction, incrementing by one, a counter associated with the thread; b) after each long-latency instruction has been processed, decrementing by one, the counter associated with the thread; c) before processing each hazard instruction, checking the value of the counter associated with the thread, and i) if the counter value is zero, processing the hazard instruction, or ii) if the counter value is non-zero, pausing processing of the hazard instruction until a later time. The processor includes means for performing steps a), b) and c) of the method.

FIELD OF THE INVENTION

The invention relates to a method for processing a thread in a pipelinedprocessor and such a pipelined processor. Particularly, but notexclusively, the invention relates to a method for processing aplurality of threads in a multi-threaded pipeline processor and such amulti-threaded pipeline processor.

BACKGROUND OF THE INVENTION

In computer architecture, a data hazard is a problem that can occur in apipelined processor. Instructions in a pipelined processor are performedin several stages, so that, at any given time, several instructions arebeing executed and the instructions may not be completed in the desiredorder. A data hazard occurs when two or more of these simultaneous, andpossibly out-of-order, instructions conflict and cause an error.

Data hazards occur when data is modified. A data hazard can occur in thefollowing situations: 1) Read after Write (RAW): An operand is modifiedand read soon after. Because the first instruction may not have finishedwriting to the operand, the second instruction may use incorrect data;2) Write after Read (WAR): Read an operand and write soon after to thatsame operand. Because the write may have finished before the read, theread instruction may incorrectly get the new written value; and 3) Writeafter Write (WAW): Two instructions that write to the same operand areperformed. The first one may finish second and therefore leave theoperand with an incorrect data value. The operands involved in datahazards can reside in memory or in a register.

A pipelined processor's instruction set may contain special instructionswhich have exceptionally high latencies relative to standardinstructions. A primary example would be an instruction which fetchesdata from memory. The problem of data hazards is relatively easy toavoid for low latency instructions i.e. instructions that can becompleted in a small number of clock ticks, because it is thenrelatively easy to ensure that the instructions within a particularthread are completed in the issued order. However, when high latencyinstructions are included in a thread, the problem of data hazards ismore significant because there is more likelihood that the instructionsin a particular thread will not complete in the issued order.

These problems arise in all sorts of circumstances e.g. in 3D graphicsprocessors, in Central Processing Units (CPUs) including dedicated mediaCPUs in which real time inputs are being received, and in communicationwith multi-processor systems.

To deal with the high latency instructions, the processor should ideallyprovide a mechanism to swap out a thread which is waiting forinstructions to complete. However, certain requirements also have to befulfilled.

Firstly, in a multi-threaded processor, many threads might havepotential data hazards i.e. instructions which depend upon precedinginstructions being completed, before they are processed.

Secondly, each thread might have a large number of long latencyinstructions, which could all be adjacent in the stream. It must bepossible to allow the return data from long latency instructions to comeback in a different order from which they were dispatched. Given thatthere could be a number of long latency instructions being processed atone time, we must reduce as much as possible processor stalling due todata hazards from long latency instructions.

Thirdly, it has to be possible to skip over any instructions in thethread where there is a branch in the thread, especially those whichmight cause a data hazard, because they depend upon precedinginstructions being completed before they are processed.

Fourth, it must be possible to read results in a different order thanthey were written. Fifth, there shall be no penalty for multiple readaccesses of destinations. Sixth, it also must be permitted for the samedestination to be written to and re-used as a destination for anotherlong latency instruction.

Finally, it is preferable that no dedicated or mass storage is needed inprocessing the long latency instructions and potential data hazardinstructions. It is also preferable that gate costs are kept to aminimum.

It is an object of the invention to provide a method and apparatus forprocessing threads which mitigates or overcomes the problem of datahazards in long latency operations.

SUMMARY OF THE INVENTION

According to a first aspect of the invention, there is provided a methodfor processing a thread in a pipelined processor, the thread comprisinga plurality of sequential instructions, the plurality of sequentialinstructions comprising some short-latency instructions and somelong-latency instructions and at least one hazard instruction, thehazard instruction requiring one or more preceding instructions to beprocessed before the hazard instruction is processed, the methodcomprising the steps of: a) before processing each long-latencyinstruction, incrementing by one, a counter associated with the thread;b) after each long-latency instruction has been processed, decrementingby one, the counter associated with the thread; c) before processingeach hazard instruction, checking the value of the counter associatedwith the thread, and i) if the counter value is zero, processing thehazard instruction, or ii) if the counter value is non-zero, pausingprocessing of the hazard instruction until a later time.

Thus, if the counter is non-zero, which means that one or more longlatency instructions are being processed and are still outstanding, thehazard instruction is put on hold. This means that there is nopossibility that the hazard instruction could be processed before anypreceding threads, thereby causing a data hazard. Short-latencyinstructions are those instructions which can be completed within acertain, predetermined number of clock ticks. Long-latency instructionsare those instructions which require more than the predetermined numberof clock ticks to be completed.

Preferably, the method is for processing a plurality of threads, eachthread having a respective counter associated therewith. The pluralityof threads are the threads which may be resident at any one time. In apreferred embodiment, the number of resident threads is 16.

Preferably, the or each thread in the processor is, at any one time,either being processed, or waiting to be processed, or paused inaccordance with step c) ii).

Preferably, at any one time, a subset of the plurality of threads isbeing processed. In one preferred embodiment, the number of threads inthe subset is 4.

Preferably, the method further comprises processing the subset ofthreads by executing one instruction from each thread in the subset in around robin manner.

Advantageously, the number of threads in the subset is equal to themaximum number of clock ticks required to process a short-latencyinstruction. In this way, for short latency instructions, there is nopossibility of data hazards.

In one embodiment, the method further comprises: after processing thefinal instruction of a thread, removing the thread from the subset ofthe plurality of threads. Thus, once a thread has been completelyprocessed, there is a space available in the subset.

Preferably, the method further comprises the step of: periodicallychecking the value of the counter associated with any threads havinginstructions that have been paused in accordance with step c) ii), and,if the value of the counter of a thread is zero, transitioning thatthread to the waiting to be processed state. Thus, once the counter ofthe thread has reduced to zero, we know that there are no long latencyinstructions still outstanding. Thus, the pausing of the thread (inaccordance with step c) ii)) can now be removed. In one embodiment, thestep of checking is carried out on every clock tick.

Preferably, the processor is arranged to process any number of threadsup to the plurality of threads, such that, at any one time, zero, one ormore of the plurality of thread locations are empty.

Preferably, the or each thread has a plurality N of respective countersassociated therewith and step c) of the method comprises: beforeprocessing each hazard instruction, checking the value of at least oneof the N counters associated with the thread; and i) if all of thevalues of the at least one counters are zero, processing the hazardinstruction, or ii) if one or more of the values of the at least onecounters are non-zero, pausing processing of the hazard instructionuntil a later time.

Then, preferably, each long-latency instruction includes an indicationof which of the at least N counters should be incremented before thelong-latency instruction is processed and decremented after thelong-latency instruction is processed. Also preferably, either eachhazard instruction is preceded by an instruction that includes anindication of which of the N counters should be checked before thehazard instruction is processed, or each hazard instruction itselfincludes an indication of which of the N counters should be checkedbefore the hazard instruction is processed.

With this arrangement, use of the N hazard counters can be optimised.For example, a particular hazard instruction might depend on a firstlong latency instruction being processed, but not on a second longlatency instruction. In that case, the first long latency instructionmay include an indication that the nth counter should be incrementedbefore it is processed, and the second long latency instruction mayinclude an indication that the mth counter should be incremented beforeit is processed. Then, the hazard instruction may include an indicationthat only the nth counter of the N counters should be checked before itis processed.

There is also provided a computer program which, when run on computermeans, causes the computer means to carry out the method of the firstaspect of the invention. There is also provided a record carrier havingstored thereon such a computer program.

According to a second aspect of the invention, there is provided apipelined processor for processing a thread, the thread comprising aplurality of sequential instructions, the plurality of sequentialinstructions comprising some short-latency instructions and somelong-latency instructions and at least one hazard instruction, thehazard instruction requiring one or more preceding instructions to beprocessed before the hazard instruction is processed, the processorcomprising: a counter associated with the thread; means to increment thecounter by one, before each long-latency instruction is processed; meansto decrement the counter by one, after each long-latency instruction hasbeen processed; and means for checking the value of the counterassociated with the thread, before each hazard instruction is processed,and i) if the counter value is zero, processing the hazard instruction,or ii) if the counter value is non-zero, pausing processing of thehazard instruction until a later time.

In one preferred embodiment, the means to increment the counter and todecrement the counter comprises an instruction decoder, the instructiondecoder being able to distinguish between short-latency instructions andlong-latency instructions.

In one preferred embodiment, the processor further comprises a threadmanager and the counter is maintained by the instruction decoder but canbe accessed by the thread manager.

Preferably, the means for checking the value of the counter associatedwith the thread before a hazard instruction is processed comprises theinstruction decoder, the instruction decoder being able to distinguishbetween the hazard instructions and the remaining instructions.

Preferably, the processor is suitable for processing a plurality ofthreads, each thread having a respective counter associated therewith.In one preferred embodiment, there are 16 threads in the plurality ofthreads.

Preferably, the or each thread in the processor is, at any one time,either being processed, or waiting to be processed, or paused inaccordance with ii).

Preferably, the thread manager keeps track of the state of each residentthread in the processor.

Preferably, at any one time, a subset of the plurality of threads isbeing processed. In one preferred arrangement the number of threads inthe subset is 4.

In one embodiment, the processor further comprises means for processingthe subset of threads by executing one instruction from each thread inthe subset in a round robin manner.

Advantageously, the number of threads in the subset is equal to themaximum number of clock ticks required to process a short-latencyinstruction. Thus, there will be no data hazard problems for shortlatency instructions.

Preferably, the processor is arranged, after processing the finalinstruction of a thread, to remove the thread from the subset of theplurality of threads. This transitioning between states is preferablyperformed by the thread manager.

Preferably, the processor is arranged to periodically check the value ofthe counter associated with any threads having instructions that havebeen paused in accordance with ii), and, if the value of the counter ofa thread is zero, to transition that thread to the waiting to beprocessed state. Again, preferably, the checking and transitioning areperformed by the thread manager. The checking may be carried out onceevery clock tick.

Preferably, the processor is arranged to process any number of threadsup to the plurality of threads, such that, at any one time, zero, one ormore of the plurality of thread locations are empty.

In one embodiment, the or each thread has a plurality N of respectivecounters associated therewith and the means for checking the value ofthe counter associated with the thread comprises: means for checking thevalue of at least one of the N counters associated with the thread,before each hazard instruction is processed; and i) if all of the valuesof the at least one counters are zero, processing the hazardinstruction, or ii) if one or more of the values of the at least onecounters are non-zero, pausing processing of the hazard instructionuntil a later time.

In that case, each long-latency instruction may include an indication ofwhich of the at least N counters should be incremented before thelong-latency instruction is processed and decremented after thelong-latency instruction is processed.

Also, each hazard instruction may be preceded by an instruction thatincludes an indication of which of the N counters should be checkedbefore the hazard instruction is processed. Alternatively, each hazardinstruction may include an indication of which of the N counters shouldbe checked before the hazard instruction is processed.

Features described in relation to one aspect of the invention may alsobe applicable to the other aspect of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments of the invention will now be described withreference to the accompanying drawings, of which

FIG. 1 is a diagram of a multi-threaded processor according to anembodiment of the invention;

FIG. 2 is a schematic diagram of the thread manager 101 of FIG. 1;

FIG. 3 is a schematic diagram showing transitioning between the fourpossible states of a thread;

FIG. 4 is a schematic diagram of a 32-bit long latency memory loadinstruction; and

FIG. 5 is a schematic diagram of a 32-bit WHC instruction.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

FIG. 1 shows a multi-threaded processor according to one embodiment ofthe invention. The processor 100 comprises a thread manager 101 whichreceives submissions of threads to be processed. The thread manager 101is connected to an instruction fetcher 103, which can fetch theappropriate instruction, as indicated by the thread manager 101, from anexternal memory 105 or internal cache 107. The fetched instruction thengoes to an instruction decoder 109. The register store 111 forwards theinstructions for execution and the result is returned and input into theregister store. The register store also fetches the source arguments tobe sent down to instruction execution.

In this embodiment, the multi-threaded processor can deal with twocategories of instructions. The first category comprises low latency (orshort latency) instructions where the instruction result is written backto the register store in a deterministic and few number of clock ticks.These instructions execute within the processor 100. Examples of suchinstructions are simple addition and multiplication. These low-latencyinstructions are relatively easy to handle and are shown at loop 113 inFIG. 1.

The second category comprises high-latency (or long latency)instructions where the instruction result is written back to theregister store from an external unit with a variable and unpredictablelatency, which could be as large as dozens of clock ticks. These areshown in FIG. 1 at loop 115. These high latency instructions sendprocessing requests to external units 117 such as memory interfaces,texture sampling units and math co-processors. Note that, in FIG. 1, theunits are shown to be outside the processor module 100, but this is notnecessarily the case. They are tightly coupled with the processor 100and are likely on the same die.

Operation of the processor 100 of FIG. 1 will now be described, firstlyfor the low-latency instructions, secondly for the high-latencyinstructions. Operation is performed so as to avoid data hazards. In theprocessor, once an instruction has begun execution i.e. it has leftthread manager 101 and has been sent to the instruction fetcher, itcannot stop. Any hazards cause the processor pipeline to stall until thedata hazard clears.

FIG. 2 is a schematic diagram of the thread manager 101. Themulti-threaded processor 100 is configured to have a certain number ofresident threads that can simultaneously exist on the processor. In theexample shown in FIG. 2, this number is 16. Each of the 16 residentthreads has an ID 201. At any given time, each thread is in a particularstate 205. The four possible states are “empty”, “ready”, “running” and“hazard” and the four states will be described further below. The valuesin columns 203 and 207 will also be described below.

From those 16 resident threads, a subset 209 of threads is in the“running” state at any one time. The thread manager 101 executes oneinstruction from each running thread in a round robin manner on a clocktick by clock tick basis (see 211). The number of threads in the subsetthat can be in the “running” state at any one time is equal to thelargest number of clock ticks it takes to process the low latencyinstructions. Because of this, it is guaranteed that the processing of aparticular instruction is completed in a single cycle such that a laterinstruction in the same thread cannot be processed until the earlierinstruction has been processed. In the example in FIG. 2, the number inthe “running” subset 209 is equal to 4. Thus, as long as there are fourresident threads running there will be no data hazard penalties on lowlatency instructions.

Referring once again to FIG. 2, the thread manager 101 keeps track ofeach resident thread's state. As already discussed, the “running” statemeans that the thread is currently being executed. The “empty” statemeans that there is no resident thread present for this ID 201. The“ready” state means that the thread is ready to be executed as soon asthere is space in the “running” subset 209. The “hazard” state will bedescribed below.

The transitioning between the four states is shown schematically in FIG.3. Firstly, let's consider the transition 301 from “empty” to “ready”.This happens when a new thread is submitted and it then waits in theresident threads to enter the “running” subset 209.

Now let's consider the transition from “ready” to “running” 303. When aspace becomes available in the “running” subset (which may happen when arunning thread finishes its final instruction thereby freeing up spacein the “running” subset), the thread manager selects one of the “ready”threads and transitions it to the “running” state. This is known asscheduling a thread. The thread manager may use an algorithm to selectthe ready thread for transition from the resident ready threads, forexample it may use a simple round-robin approach.

Now let's consider the transition 305 from “running” to “empty”. Theinstruction decoder can tell when it receives the final instruction of aparticular thread. When that happens, it tells the thread manager tomove the state of that thread from “running” to “empty” by communicatingalong the interface 119 (see FIG. 1) with a command that includes thethread ID 201 and a command to make the thread “empty”. This will thenfree up space in the “running” subset so that another ready residentthread can be transitioned to “running” 303.

Finally, let's consider the transition 307 from running back to “ready”.This might happen if the processor supports a timer such that the numberof clock ticks that any given thread could remain running was limited.Then, if a thread exceeded this threshold, it would transition back tothe “ready” state from the “running” state. This could be achieved via acommand from the instruction decoder to the thread manager via theinterface 119 in FIG. 1, the command including the thread ID 201 and acommand to make the thread “ready”. This type of mid-thread transitionis known as de-scheduling a thread. Because the thread has not beencompleted, it transitions to “ready” rather than “empty” and the threadwill need to be scheduled again at a later point in order to complete.The act of de-scheduling a thread will cause the thread manager to runan algorithm to choose one of the “ready” threads to schedule, because aspace will be available in the “running” subset.

Transitions 309 and 311 to and from the hazard state will be describedbelow.

Note that the thread manager can transition thread states in a singleclock tick since they are resident.

The description above relates to short latency instructions only, whendata hazards can be avoided relatively easily by selecting the number ofthreads permitted to be in the “running” subset. The processing of longlatency instructions will now be described.

To absorb the latency of long latency instructions two features areused: hazard counters and the hazard state.

Referring to FIG. 2, column 207 stores the hazard counter values foreach thread. In this example there are three hazard counters and theycan each take any value from 0 to 7. However, the number of hazardcounters may be different and/or the number of values each hazardcounter can take may also be different. The hazard counter values areactually stored in the instruction decoder but the thread manager hasaccess to the hazard counter values.

The instruction decoder can tell whether the instruction it has justreceived is a short latency instruction or a long latency instruction.If the instruction is a long latency instruction, the instructiondecoder automatically increments one of the hazard counters (more onwhich one later) by one. Then, it sends the instruction for execution asit would for any short latency instruction. The nominated hazardcounter, as well as the thread ID are passed down the processor pipelinefor these long latency instructions so that, when the data is finallywritten to the destination register, the hazard counter and the threadID are present on the interface.

As the data is being written to the register store 111, the instructiondecoder looks at the hazard counter and thread ID (see 121 in FIG. 1)and decrements the hazard counter for the particular thread so that theloop is closed.

An example of such a long latency instruction is shown in FIG. 4. Theinstruction shown in FIG. 4 is a 32 bit memory load instruction and itcomprises 24 bits of payload, 6 bits of operation instruction and 2 bitsindicating the hazard counter to be incremented.

So, as long latency instructions are fed into the instruction decoder,the instruction decoder increments an appropriate hazard counter. Whenthe long latency instructions are completed, the instruction decoderdecrements the hazard counter. Thus, the hazard counter of any threadwill depend on how many of its long latency instructions are currentlyrunning. Thus, if there are no outstanding destination register writes,the hazard counter will be zero, but if there are any outstandingdestination register writes, the hazard counter will be non-zero. Thus,referring to FIG. 2, we know that thread ID 11 has two writes currentlyoutstanding, since its first hazard counter value is 2.

The number of long latency instructions that can be running at any onetime, depends on the number of hazard counters and the maximum valuethat those hazard counters can each take. In this case, 21 (3×7) longlatency instructions can be running for a particular thread. If afurther long latency instruction were to enter the instruction decoder,the processor would stall until one of the currently running longlatency instructions had been completed, thus decreasing the hazardcounter, at which time, the processor could restart.

In the preferred embodiment, a WHC (Wait for Hazard Counters)instruction precedes an instruction which depends upon an earlierinstruction being completed before it can be implemented correctly (i.e.if it were processed first there might be a data hazard). (Analternative approach would be to include WHC bits in the instructionitself and this possibility is discussed further below.) When theinstruction decoder receives a WHC instruction, it knows it must checkthe hazard counters for the thread or threads indicated in the WHCinstruction. This is because the WHC instruction precedes an instruction(or instructions) that depends on an earlier instruction having beencompleted. If there is a non-zero hazard counter for that thread orthreads, we know that there are outstanding register writes. When theinstruction decoder checks the hazard counters for that thread, if thehazard counters are zero, nothing happens and the instruction decoderwill proceed to the next instruction for execution as normal. (The WHCinstruction itself just dies in the instruction decoder since it hasperformed its purpose.) On the other hand, if the hazard counters arenon-zero, the instruction decoder tells the thread manager to transitionthe thread into the “hazard” state (transition 309 in FIG. 3) via theinterface 119 (FIG. 1).

The WHC instruction does not need to immediately precede the instructionwhich depends on an earlier instruction being completed before it can beimplemented correctly. As long as the WHC instruction is somewherebetween the long latency instruction and the instruction which dependson completion of that long latency instruction, the WHC instruction willachieve its purpose. Also a single WHC instruction can “enable” lots oflater instructions, for example where a single long latencyinstruction's result is later used by many instructions. The laterinstructions will be indicated in the WHC instruction.

When a thread has been moved from the “running” state to the “hazard”state, a space becomes available in the “running” subset 209. Anotherresident thread which is in the “ready” state can then be scheduled i.e.moved to the “running” state, transition 303, so as to absorb thelatency of the thread while it remains in the “hazard” state. The largerthe gap between the number of resident threads (in this case 16) and thenumber of threads permitted in the “running” subset (in this case 4),the more likely it will be that the thread manager finds a thread in the“ready” state which can be scheduled to absorb the latency of the“hazard” thread.

But once the thread is in the hazard state, how does it transition outof the hazard state (i.e. transition 311)? Every clock tick, the threadmanager looks at the hazard counters of any threads that are in thehazard state. Remember that the hazard counters will decrease as longlatency instructions are being completed. If the hazard counters arestill non-zero, the thread manager takes no action and leaves the threadin the “hazard” state. If the hazard counters have, by now, reduced tozero, the thread can be transitioned from the “hazard” state to the“ready” state by the thread manager. This is the transition 311 in FIG.3. Thus, once the hazard counters have decreased to zero, we know thatthere are no outstanding destination register writes. Thus there is nochance that the WHC instruction will be processed out of order, therebycausing a data hazard.

In actual fact, the thread manager does not need to look at the hazardcounters of the “hazard” threads at each and every clock tick. As longas the thread eventually makes the transition from the “hazard” state tothe “ready” state, the checking can be done every few clock ticks. Inactual fact, this would be advantageous when working at high frequenciesand it could also reduce gate count.

It was discussed earlier that there may be several hazard counters andthe instruction decoder nominates one to increment for each long latencyinstruction. In this example, there are three hazard counters, each ofwhich can take a value from 0 to 7.

FIG. 5 shows a WHC instruction (i.e. one that precedes an instructionwhich depends on earlier instructions). The WHC instruction includesthree bits, each of which correspond to a hazard counter. If any of thebits are non-zero, this is an indication that the corresponding hazardcounter should be checked before the next instruction is implemented.The compiler should choose which hazard counter to assign to each longlatency instruction in a manner that optimises the performance of thelatency absorption.

The compiler which produces the instructions executed by the processor,should be optimised to place the WHC instruction immediately before theinstruction which uses the result of the long latency instruction. Thus,the WHC indication is received before the instruction is processed.Also, the compiler should try to position the WHC instructions in thethread, as far as possible from the long-latency instructions whichincrement the corresponding counters. This will minimise the number ofthreads in the hazard state. This may be done by the compilerreorganising the order of the instructions.

For example, a WHC instruction might not depend on all long latencyinstructions having been processed before it is processed, but simplysome of the long latency instructions in its thread. In that case, thelong latency instructions on which the WHC instruction depends, can benominated to increment, say, the first hazard counter. On the otherhand, long latency instructions on which the WHC instruction does notdepend can be nominated to increment, say, the second hazard counter.Then, in the WHC instruction, the bit corresponding to the first hazardcounter will be non-zero, whereas the bit(s) corresponding to the otherhazard counter(s) will be zero. Then, the instruction decoder knows thatthe only hazard counter that needs to be checked is the first hazardcounter, since the other hazard counters are not relevant to thisparticular WHC instruction. Referring to FIG. 4, we see that the hazardcounters to increment are indicated at bit numbers 25 and 26.

Consider the example of an instruction stream below, which is a goodusage of the hazard counters. Note that “Load” instructions are longlatency instructions in the following examples and “Add” instructionsare short latency instructions.

1) Load register location (r) 0 and increment hazard counter (HC) 0;2) Load r 1 and increment HC 0;3) Load r 2 and increment HC 1;4) Load r 3 and increment HC 1;5) Load r 4 and increment HC 2;6) Load r 5 and increment HC 2;7) Wait for HC 0 to decrease to value 0;

8) Add r 0 and r 1 and put in r 9;

9) Wait for HC 1 to decrease to value 0;

10) Add r 2 and r 3 and put in r 10;

11) Wait for HC 2 to decrease to value 0;

12) Add r 4 and r 5 and put in r 11;

Contrast this with the following instruction stream, which does not makegood usage of the hazard counters:

1) Load r 0 and increment HC 0;2) Load r 1 and increment HC 0;3) Load r 2 and increment HC 1;4) Load r 3 and increment HC 1;5) Load r 4 and increment HC 2;6) Load r 5 and increment HC 2;7) Wait for HC 0, HC 1 and HC 2 to decrease to value 0;

8) Add r 0 and r 1 and put in r 9; 9) Add r 2 and r 3 and put in r 10;10) Add r 4 and r 5 and put in r 11.

In the second example, all three hazard counters must reduce to zerobefore any of the subsequent instructions can be implemented. But, thisis not particularly efficient since the instruction to put the sum of r0 and r 1 into r 9, for example, only depends on r 0 and r 1 and not onthe other register locations. So, the thread will be in the “hazard”state for longer. The first example makes much better use of the hazardcounters. By dividing up the long latency instructions across theavailable hazard counters, instructions can start earlier therebyproviding a running thread to absorb hazards on other threads. Thecompiler knows the difference between long latency instructions andshort latency instructions and also knows which instructions depend onwhich earlier instructions, so can make efficient use of the hazardcounters.

In the FIG. 4 example, the hazard counters are nominated in thelong-latency instruction. However, this does not need to be the case andan alternative method would be to tie the hazard counter number to theleast significant bits of the destination register address. So, forexample, any loads into register location 0 will increment hazardcounter 0, but any loads into register location 5 will increment hazardcounter 1. This conserves valuable instruction encoding space, but makesit difficult to implement complex addressing modes such as indexing,where the actual destination register address is not known by thecompiler or the instruction decoder.

In this example, the WHC instructions supports checking of more than onehazard counter at a time. That is, in the example of FIG. 5, the WHCsupports checking of three hazard counters since there are three WHCbits in the WHC instruction. However, this does not necessarily need tobe the case and it could be that each WHC instruction simply nominates asingle hazard counter which should be checked. In that case a first WHCinstruction could nominate HC1 to check and a second WHC instructioncould nominate HC2 to check, and so on. This would save on a smallnumber of gates.

A number of general points should be noted regarding the aboveembodiment. The described embodiment includes 16 resident threads, ofwhich up to 4 are in the “running” subset at any one time. But thesystem also scales to very high clock frequencies. As the clockfrequency increases, the latency, in terms of clock ticks, of both shortlatency and long latency instructions increases. This means that thenumber of threads in the “running” subset can be increased to satisfythe new, higher latency of short latency instructions. And, the numberof resident threads can also be increased to compensate.

Other points about the described embodiments are as follows:

Before a thread can terminate, the hazard counters must all be zero.(This is fairly clear because a non-zero hazard counter means that thereis an outstanding instruction.)

The instruction decoder stores the hazard counters for all the threads.The hazard counters can be non-zero when the thread is in the hazardstate. As soon as the hazard counters are zero, at the next check, thethread manager will transition them to the ready state.

In some cases, the hazard counters can be non-zero in the running state.This occurs when a long latency instruction is executed (whichincrements the hazard counter) but before a WHC instruction has arrived.Or, even when a WHC instruction has arrived, it may not correspond tothat hazard counter, so the thread will remain in the running state.

The example described above with reference to FIGS. 1 to 5 showshardware for performing the method. The method of the invention could ofcourse alternatively be implemented in software.

An alternative, efficient use of the hazard counters is as follows. InSIMD processors, one long-latency instruction might kick start severalregister writes via an external unit. A problem might be encounteredwith this because the instruction decoder might only increment thehazard counter by 1 at the outset but might try to decrement the hazardcounter several times at the end, once for each register write. We cancope with this by passing a flag around the pipeline which indicatesjust the final register write for the instruction. When the instructiondecoder sees this flag, it will know to decrement the hazard counter.So, each long latency instruction only increments the hazard counter byone, even if there are multiple register writes. It is possible that thehazard counters could be used in a different way. For example, thehazard counter could be incremented at the outset, once for eachregister write rather than once for the entire long latency instruction.Then, no flag would be needed because the instruction decoder would justdecrement the hazard counter by one, once each register write iscompleted. This way of using hazard counters is equally valid but doesmean that the hazard counters are used up more quickly, increasing thelikelihood of a stall for fewer long latency instructions in thepipeline.

In addition, it is possible to use the WHC instructions in a slightlydifferent way. The functionality of the WHC instruction could beintegrated into all the “regular” instructions thereby eliminating it asan additional instruction. In this case, there are bits in theinstruction encoding which tell the instruction decoder which hazardcounters, if any, to wait for. These bits do not need to be on everyinstruction, but for those which do not have them, they cannot beconfigured to wait for data to return from long-latency instructions.The benefit is a reduction in the size of programs and instructionbandwidth but at the cost of instruction encoding bits. In theintroduction, a number of requirements were set out which the inventionhad to meet while solving the problems.

Firstly, in a multi-threaded processor, many threads might havepotential data hazards. This is satisfied in the described embodiment,since, of the 16 resident threads, up to 12 can be in the hazard statewhile 4 are still running. In different embodiments (with a differentnumber of threads permitted to be resident and a different numberpermitted to be in the running subset), this would also work.

Secondly, each thread might have a large number of long latencyinstructions. This is also satisfied here by the number of hazardcounters and the range of values that each hazard counter can take. Inthe described embodiment, the hardware provides three hazard counters,each of which can run from value 0 to value 7. Therefore, each residentthread can guarantee that 21 long latency instructions can beoutstanding and undergoing processing, before the processor stalls. Ifwe knew that the number of long latency instructions was particularlyhigh, we could increase the number of hazard counters or the range ofvalues that each could take or, if we thought it sufficient, the numberof hazard counters could be decreased. In a preferred embodiment used inpractice, only two hazard counters are used, each taking a value between0 and 15. It also must be possible for the return data from long latencyinstructions to come back in a different order from which they weredispatched. This is satisfied by the following instruction stream whichmay be executed:

1) Load r 0 and increment HC 0 (long latency Load instruction);2) Sample r 1 and increment HC 0 (this is a long latency samplinginstruction);3) Wait for HC 0 to decrease to value 0;4) Add r 0 and r 4 and put in r 2 (short latency Add instruction);

5) Add r 1 and r 5 and put in r 3.

In this example, which would work with this invention, r 1 can bewritten by the texture sampling unit before r 0 is written to by thememory interface. Both of the instructions only decrement HC 0 so theorder is not important.

We also must reduce, as much as possible, processor stalling due to datahazards from long latency instructions. The largest gains are expectedto be due to threads getting de-scheduled on WHC instructions becausefresh threads will be scheduled, which hopefully will run for quite sometime before they encounter any WHC instructions themselves. In addition,the WHC instruction does not have to come right after the long latencyinstruction. It should come immediately before the data is sourced.Useful instructions can be inserted in between the long latencyinstruction and the WHC which will further absorb latency. By havingmultiple hazard counters, time spent waiting for data that is notactually required is reduced. For example, consider the followinginstruction stream in a system with only two hazard counters, HC 0 andHC 1:

1) Load r 0 and increment HC 0 (long latency Load instruction);2) Load r 1 and increment HC 0;3) Load r 2 and increment HC 1;4) Wait for HC 0 to decrease to value 0 (i.e. wait for r 0 and r 1);5) Add r 0 and r 10 and put in r 3 (short latency Add instruction);

6) Add r 3 and r 11 and put in r 4; 7) Add r 4 and r 12 and put in r 5;

8) Add r 5 and r 1 and put in r 6 (first time r 1 is used);9) Add r 6 and r 13 and put back in r 6;10) Add r 6 and r 14 and put back in r 6;11) Add r 6 and r 15 and put back in r 6;12) Wait for HC 1 to decrease to value 0;13) Add r 2 and r 6 and put back in r 6.

In this example the latency absorption of a system with two counters isworse than one with three counters.

It has to be possible to skip over any instructions in the thread so asto support flow control such as branches. This is satisfied here sinceevery long latency instruction increments the hazard counter when it isdecoded and decrements the hazard counter when it is written to thedestination register. Regardless of whether there is a WHC instructionor not, this incrementing and decrementing always happens and thecoherency of the hazard counters is always maintained. A thread cannotterminate until all the hazard counters for the thread are zero.

It must also be possible to read results in a different order than theywere written. The following example illustrates this point:

1) Load r 0 and increment HC 0;2) Load r 1 and increment HC 0;3) Wait for HC 0 to decrease to value 0;

4) Add r 1 and r 4 and put in r 2; 5) Add r 0 and r 5 and put in r 3.

In this example, the r 0 load is carried out before the r 1 load but ther 0 add is carried out after the r 1 add.

It also must be possible for there to be multiple read access fordestinations. The following example illustrates this point:

1) Load r 0 and increment HC 0;2) Wait for HC 0 to decrease to value 0;

3) Add r 0 and r 4 and put in r 2; 4) Add r 0 and r 5 and put in r 3.

In this example, r 0 must be read twice for each of instructions 3) and4).

It also must be permitted for the same destination to be written to andre-used as a destination for another long latency instruction. Thefollowing example illustrates this point:

1) Load r 0 and increment HC 0;2) Wait for HC 0 to decrease to value 0;

3) Add r 0 and r 4 and put in r 2;

4) Load r 0 and increment HC 0;5) Wait for HC 0 to decrease to value 0;

6) Add r 0 and r 5 and put in r 3;

In this example, r 0 is written to twice.

Finally, it is preferable that no special or extra storage is needed inprocessing the long latency instructions and potential data hazardinstructions. The data returning from the external units does not needto go into special storage, like a FIFO, but it can go straight into thenominated destination register. It is up to the compiler to make sure noinstruction writes to the same destination register, otherwise a WHCinstruction is required first.

It is also preferable the gate costs are kept to a minimum. The systemis scalable, the number of hazard counters can be adjusted and themaximum hazard counter value can also be adjusted. One thing thisaffects is how many long latency instructions can be dispatched perthread before stalling occurs due to the hazard counters saturating. Theamount of storage per resident thread is minimal: just 12 bits perthread for three hazard counters ranging from 0 . . . 7.

1. A method for processing a thread in a pipelined processor, the threadcomprising a plurality of sequential instructions, the plurality ofsequential instructions comprising some short-latency instructions andsome long-latency instructions and at least one hazard instruction, thehazard instruction requiring one or more preceding instructions to beprocessed before the hazard instruction is processed, the methodcomprising the steps of: a) before processing each long-latencyinstruction, incrementing by one, a counter associated with the thread;b) after each long-latency instruction has been processed, decrementingby one, the counter associated with the thread; c) before processingeach hazard instruction, checking the value of the counter associatedwith the thread, and i) if the counter value is zero, processing thehazard instruction, or ii) if the counter value is non-zero, pausingprocessing of the hazard instruction until a later time.
 2. A methodaccording to claim 1, wherein the method is for processing a pluralityof threads, each thread having a respective counter associatedtherewith.
 3. A method according to claim 1, wherein the or each threadin the processor is, at any one time, either being processed, or waitingto be processed, or paused in accordance with step c) ii) of claim
 1. 4.A method according to claim 3, wherein, at any one time, a subset of theplurality of threads is being processed.
 5. A method according to claim4, wherein the method further comprises processing the subset of threadsby executing one instruction from each thread in the subset in a roundrobin manner.
 6. A method according to claim 4, wherein the number ofthreads in the subset is equal to the maximum number of clock ticksrequired to process a short-latency instruction.
 7. A method accordingto claim 4, further comprising: after processing the final instructionof a thread, removing the thread from the subset of the plurality ofthreads.
 8. A method according to claim 3, further comprising the stepof: periodically checking the value of the counter associated with anythreads having instructions that have been paused in accordance withstep c) ii) of claim 1, and, if the value of the counter of a thread iszero, transitioning that thread to the waiting to be processed state. 9.A method according to claim 8, wherein the step of checking is carriedout every clock tick.
 10. A method according to claim 2, wherein theprocessor is arranged to process any number of threads up to theplurality of threads, such that, at any one time, zero, one or more ofthe plurality of thread locations are empty.
 11. A method according toclaim 1, wherein the or each thread has a plurality N of respectivecounters associated therewith and step c) comprises: before processingeach hazard instruction, checking the value of at least one of the Ncounters associated with the thread; and i) if all of the values of theat least one counters are zero, processing the hazard instruction, orii) if one or more of the values of the at least one counters arenon-zero, pausing processing of the hazard instruction until a latertime.
 12. A method according to claim 11, wherein each long-latencyinstruction includes an indication of which of the at least N countersshould be incremented before the long-latency instruction is processedand decremented after the long-latency instruction is processed.
 13. Amethod according to claim 12, wherein each hazard instruction ispreceded by an instruction that includes an indication of which of the Ncounters should be checked before the hazard instruction is processed.14. A method according to claim 12, wherein each hazard instructionincludes an indication of which of the N counters should be checkedbefore the hazard instruction is processed.
 15. A computer programwhich, when run on computer means, causes the computer means to carryout the method of claim
 1. 16. A record carrier having stored thereon acomputer program according to claim
 15. 17. A pipelined processor forprocessing a thread, the thread comprising a plurality of sequentialinstructions, the plurality of sequential instructions comprising someshort-latency instructions and some long-latency instructions and atleast one hazard instruction, the hazard instruction requiring one ormore preceding instructions to be processed before the hazardinstruction is processed, the processor comprising: a counter associatedwith the thread; means to increment the counter by one, before eachlong-latency instruction is processed; means to decrement the counter byone, after each long-latency instruction has been processed; and meansfor checking the value of the counter associated with the thread, beforeeach hazard instruction is processed, and i) if the counter value iszero, processing the hazard instruction, or ii) if the counter value isnon-zero, pausing processing of the hazard instruction until a latertime.
 18. A processor according to claim 17, wherein the means toincrement the counter and to decrement the counter comprises aninstruction decoder, the instruction decoder being able to distinguishbetween short-latency instructions and long-latency instructions.
 19. Aprocessor according to claim 18, further comprising a thread manager andwherein the counter is maintained by the instruction decoder but can beaccessed by the thread manager.
 20. A processor according to claim 18,wherein the means for checking the value of the counter associated withthe thread before a hazard instruction is processed comprises theinstruction decoder, the instruction decoder being able to distinguishbetween the hazard instructions and the remaining instructions.
 21. Aprocessor according to claim 17, wherein the processor is suitable forprocessing a plurality of threads, each thread having a respectivecounter associated therewith.
 22. A processor according to claim 17,wherein the or each thread in the processor is, at any one time, eitherbeing processed, or waiting to be processed, or paused in accordancewith ii) of claim
 17. 23. A processor according to claim 22, wherein, atany one time, a subset of the plurality of threads is being processed.24. A processor according to claim 23, wherein the processor furthercomprises means for processing the subset of threads by executing oneinstruction from each thread in the subset in a round robin manner. 25.A processor according to claim 23, wherein the number of threads in thesubset is equal to the maximum number of clock ticks required to processa short-latency instruction.
 26. A processor according to claim 23,wherein the processor is arranged, after processing the finalinstruction of a thread, to remove the thread from the subset of theplurality of threads.
 27. A processor according to claim 23, wherein theprocessor is arranged to periodically check the value of the counterassociated with any threads having instructions that have been pausedwhen the counter value is non-zero, and, if the value of the counter ofa thread is zero, to transition that thread to the waiting to beprocessed state.
 28. A processor according to claim 27, arranged tocheck the value of the counter every clock tick.
 29. A processoraccording to claim 21 arranged to process any number of threads up tothe plurality of threads, such that, at any one time, zero, one or moreof the plurality of thread locations are empty.
 30. A processoraccording to claim 17, wherein the or each thread has a plurality N ofrespective counters associated therewith and the means for checking thevalue of the counter associated with the thread comprises: means forchecking the value of at least one of the N counters associated with thethread, before each hazard instruction is processed; and i) if all ofthe values of the at least one counters are zero, processing the hazardinstruction, or ii) if one or more of the values of the at least onecounters are non-zero, pausing processing of the hazard instructionuntil a later time.
 31. A processor according to claim 30, wherein eachlong-latency instruction includes an indication of which of the at leastN counters should be incremented before the long-latency instruction isprocessed and decremented after the long-latency instruction isprocessed.
 32. A processor according to claim 31, wherein each hazardinstruction is preceded by an instruction that includes an indication ofwhich of the N counters should be checked before the hazard instructionis processed.
 33. A processor according to claim 31, wherein each hazardinstruction includes an indication of which of the N counters should bechecked before the hazard instruction is processed.